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Abstract 

A parallel program can be represented as a directed acyclic graph. An im- 
portant performance bound is the time to execute the critical path through 
the graph. We show how this performance metric is related to Amdahl speedup 
and the degree of average parallelism. These bounds formally exclude superlinear 
performance. 

1 Computational DAG 

A parallel program can be represented as a directed acyclic graph (DAG), where nodes 
correspond to tasks (or subtasks) and arrows represent control or communication be- 
tween tasks. Leiserson [ ] characterizes the performance of parallel programs by the 
elapsed time T\ to execute all the nodes in a DAG (e.g., Fig. 1), and the time to 
execute the critical path. 




Figure 1: Critical path [orange) in an parallel task graph [1] 
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In project management, a critical path is the sequence of project network activities 
(e.g., a PERT chart) which add up to the longest overall duration. It determines the 
shortest possible time to complete the project. Any delay of an activity on the critical 
path directly impacts the planned project completion date (i.e. there is no float on the 
critical path). A project can have more than one critical path. 

The time to execute a program on p processors is T p and the speedup metric is: 

s P = p (1) 

J-p 



with computational efficiency: 
i.e., the average amount of speedup per processor. 



Ep = ^ (2) 

P 



2 Performance Bounds 

Leiserson [ ] claims there are two lower bounds on parallel performance for Fig. 1: 

T p > Ti/p (3) 
T p > (4) 

Ti/p is the reduced execution time attained by partitioning the work (equally) across 
p processors. Clearly, T p cannot be less than the time it takes to execute a p-th of 
the work — the meaning of (3). Similarly, T p cannot be less than the time it takes to 
execute the critical path, even if there are an infinite number of physical processors — the 
meaning of (4). 

Substituting (3) into (1): 

s ' = Wp =p (5) 

which corresponds to ideal linear speedup. In reality, we expect the speedup to be 
generally sublinear: 

S p <p (6) 
Under certain special circumstances speedup may exhibit superlinear performance: 

S p >p (7) 

Leiserson excludes (7) on the basis of (3). He also states that because of (4), the 
maximum possible speedup is given by: 

Soo = ^ (8) 

-*- CO 

He calls (8) the "parallelism" and it corresponds to the average amount of work-per- 
node along the critical path. But what do these bounds really mean? 
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2.1 Example 

Consider an example based on Fig. 1. 

Example 1 Following Leiserson, let's assume for simplicity that each node in the DAG 
takes just 1 unit of time to execute. Then, the total time to execute the entire DAG on 
a single processor is T\ = 18 time steps. 

Similarly, the critical path contains nine nodes, so = 9 and from (8): 

1 8 

Soo = ¥ = 2 (9) 
Hence, the maximum possible speedup is 2. 



Note, however, that this maximum speedup (8) is not the same as the more familiar 
Amdahl bound [3, 4, 5]: 

^Amdahl = I ( 1Q ) 

Equation (10) is the asymptotic form of the Amdahl speedup function [3]: 

q Amdahl P / 1 i \ 

bp ~ l + v(p-l) {U) 

in the limit of an infinite number of processors p — >■ oo. In Fig. 1, the serial fraction 
(a) corresponds to 4 single nodes out of 18 total nodes and therefore: 

^Amdahl = ^ = 4.5 (12) 

which is numerically greater than the "maximum" in (9). 



3 Reconciliation 



How can we reconcile these various algorithmic speedup metrics? 



3.1 Average Parallelism 

Theorem 1 Leiserson's is identical to the average parallelism [ , } defined as: 

W 

A = Y (13) 

where W is the total amount of work (expressed in cpu-seconds, for example) and T is 
the total parallel execution time. 
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Proof 1 Calculate W using the following procedure: 

1. Start at the top of the DAG 

2. At each level where there are nodes, draw a horizontal line through them 

3. On each horizontal row calculate the time-node product for each node 

4. Sum all the time-node products on each row to get W 

which can be written symbolically as: 

depth 

W = ^ U x n % (14) 

i=i 

For Fig. 1 we obtain: 

W = (1 x 1) + (1 x 1) + (1 x 1) + (1 x 3) + (1 x 4) + (1 x 4)+ 

(1x2) + (lxl) + (lxl) (15) 

or W = 18. The value of T can be obtained be simply adding together all the time 
factors in the products of (15), i.e., T = 9, since U — 1 and there are nine terms. 
Applying (13): A = 18/9 = 2, which is identical to (9). Thus, = A. ■ 

Remark 1 This is consistent with Leiserson's definition of as the average amount 
of work-per-node along the critical path. See Section 2. Other examples of calculating 
average parallelism are presented in Ref. [6]. 

3.2 Super linear Performance 

Finally, we can see how Leiserson excludes superlinear performance on the basis of 
bound (3). 

Theorem 2 The bound (3) is equivalent to any computational DAG compressed to 
depth one. 

Proof 2 In Fig. 1, such node compression is equivalent to having all 18 nodes positioned 
on the same horizontal row. Since it is not possible to squash the DAG any flatter, the 
best possible speedup corresponds to distributing those 18 nodes simultaneously onto 
p = 18 processors. This bound is identical to ideal linear speedup (5), i.e., S p = 18. ■ 

Corollary 1 From (2), linear speedup corresponds to an efficiency E p = 18/18 = 1. 
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Superlinear speedup (7) corresponds to an efficiency E p > 1. One way this might 
be observed is to run the work in Fig. 1 successively on p = 1, 2, 3, . . . processors. The 
speedup for small-p would be inferior to that for large-p, so the scaling would appear 
to become better than linear. However, this apparent improvement is just an artifact 
of choosing the wrong baseline to establish linearity in the first place. 
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